Predict Default of Credit Card Clients

Predict Default of Credit Card Clients by customer last months data

2. Dataset

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. There are 25 variables:

  • ID: ID of each client

  • LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

  • SEX: Gender (1=male, 2=female)
  • EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
  • MARRIAGE: Marital status (1=married, 2=single, 3=others)
  • AGE: Age in years
  • PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, …
  • 8=payment delay for eight months, 9=payment delay for nine months and above)
  • PAY_2: Repayment status in August, 2005 (scale same as above)
  • PAY_3: Repayment status in July, 2005 (scale same as above)
  • PAY_4: Repayment status in June, 2005 (scale same as above)
  • PAY_5: Repayment status in May, 2005 (scale same as above)
  • PAY_6: Repayment status in April, 2005 (scale same as above)
  • BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
  • BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
  • BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
  • BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
  • BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
  • BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
  • PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
  • PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
  • PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
  • PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
  • PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
  • PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
  • default.payment.next.month: Default payment (1=yes, 0=no)
In [ ]:
 

3.1 Data Explore

  • Load the liberaries
  • Load the data
  • correct data "data types"
  • Check duplication
  • Check missing data
  • Check NAN/zero data
  • Check data statstics
  • check unique values for the catogorical columns
  • Check the data ditrebution for the catogorical data
  • Check data balance
  • Check data features correlation
In [1]:
import pandas as pd
import numpy as np                                
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
import cufflinks as cf
cf.go_offline()
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support
C:\Users\ws250158\AppData\Local\Continuum\anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
# from plotly import tools
# import plotly.tools as tls
# import squarify
# from mpl_toolkits.basemap import Basemap
# from numpy import array
# from matplotlib import cm

# import cufflinks and offline mode
import cufflinks as cf
cf.go_offline()
In [3]:
#Read the data from xls file with sheet nema : Data
data=pd.read_excel('default-of-credit-card-clients.xls', 'Data', index_col=0, na_values=['NA'],)
In [4]:
new_header =data.iloc[0]
data = data[1:] 
data.columns = new_header
In [5]:
data.head()
Out[5]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 20000 female university married 24 2 2 -1 -1 -2 ... 0 0 0 0 689 0 0 0 0 1
2 120000 female university single 26 -1 2 0 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 90000 female university single 34 0 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 50000 female university married 37 0 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
5 50000 male university married 57 -1 0 -1 0 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 24 columns

In [6]:
print('Number of rows '+ str(data.shape[0]))
print('Number of cloumns '+ str(data.shape[1]))
Number of rows 30000
Number of cloumns 24
In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   LIMIT_BAL                   30000 non-null  object
 1   SEX                         30000 non-null  object
 2   EDUCATION                   29669 non-null  object
 3   MARRIAGE                    29677 non-null  object
 4   AGE                         30000 non-null  object
 5   PAY_0                       30000 non-null  object
 6   PAY_2                       30000 non-null  object
 7   PAY_3                       30000 non-null  object
 8   PAY_4                       30000 non-null  object
 9   PAY_5                       30000 non-null  object
 10  PAY_6                       30000 non-null  object
 11  BILL_AMT1                   30000 non-null  object
 12  BILL_AMT2                   30000 non-null  object
 13  BILL_AMT3                   30000 non-null  object
 14  BILL_AMT4                   30000 non-null  object
 15  BILL_AMT5                   30000 non-null  object
 16  BILL_AMT6                   30000 non-null  object
 17  PAY_AMT1                    30000 non-null  object
 18  PAY_AMT2                    30000 non-null  object
 19  PAY_AMT3                    30000 non-null  object
 20  PAY_AMT4                    30000 non-null  object
 21  PAY_AMT5                    30000 non-null  object
 22  PAY_AMT6                    30000 non-null  object
 23  default payment next month  30000 non-null  object
dtypes: object(24)
memory usage: 5.7+ MB
In [8]:
#correct data "data types"
data = data.astype({"PAY_0":'float',"PAY_2":'float',"PAY_3":'float',"PAY_4":'float',"PAY_5":'float',"PAY_6":'float',"AGE":'float', "BILL_AMT1":'float',"BILL_AMT2":'float',"BILL_AMT3":'float',
                      "BILL_AMT4":'float',"BILL_AMT5":'float',"BILL_AMT6":'float',"PAY_AMT1":'float',
                      "PAY_AMT2":'float',"PAY_AMT3":'float',"PAY_AMT4":'float',"PAY_AMT5":'float',"PAY_AMT6":'float'
                     ,"LIMIT_BAL":'float',"default payment next month":'float'})
In [9]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   LIMIT_BAL                   30000 non-null  float64
 1   SEX                         30000 non-null  object 
 2   EDUCATION                   29669 non-null  object 
 3   MARRIAGE                    29677 non-null  object 
 4   AGE                         30000 non-null  float64
 5   PAY_0                       30000 non-null  float64
 6   PAY_2                       30000 non-null  float64
 7   PAY_3                       30000 non-null  float64
 8   PAY_4                       30000 non-null  float64
 9   PAY_5                       30000 non-null  float64
 10  PAY_6                       30000 non-null  float64
 11  BILL_AMT1                   30000 non-null  float64
 12  BILL_AMT2                   30000 non-null  float64
 13  BILL_AMT3                   30000 non-null  float64
 14  BILL_AMT4                   30000 non-null  float64
 15  BILL_AMT5                   30000 non-null  float64
 16  BILL_AMT6                   30000 non-null  float64
 17  PAY_AMT1                    30000 non-null  float64
 18  PAY_AMT2                    30000 non-null  float64
 19  PAY_AMT3                    30000 non-null  float64
 20  PAY_AMT4                    30000 non-null  float64
 21  PAY_AMT5                    30000 non-null  float64
 22  PAY_AMT6                    30000 non-null  float64
 23  default payment next month  30000 non-null  float64
dtypes: float64(21), object(3)
memory usage: 5.7+ MB
In [10]:
#check data statstics
data.describe()
Out[10]:
ID LIMIT_BAL AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
count 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 ... 30000.000000 30000.000000 30000.000000 30000.000000 3.000000e+04 30000.00000 30000.000000 30000.000000 30000.000000 30000.000000
mean 167484.322667 35.518833 -0.016700 -0.133767 -0.166200 -0.220667 -0.266200 -0.291100 51223.330900 49179.075167 ... 43262.948967 40311.400967 38871.760400 5663.580500 5.921163e+03 5225.68150 4826.076867 4799.387633 5215.502567 0.221200
std 129747.661567 9.528019 1.123802 1.197186 1.196868 1.169139 1.133187 1.149988 73635.860576 71173.768783 ... 64332.856134 60797.155770 59554.107537 16563.280354 2.304087e+04 17606.96147 15666.159744 15278.305679 17777.465775 0.415062
min 10000.000000 21.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 -165580.000000 -69777.000000 ... -170000.000000 -81334.000000 -339603.000000 0.000000 0.000000e+00 0.00000 0.000000 0.000000 0.000000 0.000000
25% 50000.000000 28.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 3558.750000 2984.750000 ... 2326.750000 1763.000000 1256.000000 1000.000000 8.330000e+02 390.00000 296.000000 252.500000 117.750000 0.000000
50% 140000.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 22381.500000 21200.000000 ... 19052.000000 18104.500000 17071.000000 2100.000000 2.009000e+03 1800.00000 1500.000000 1500.000000 1500.000000 0.000000
75% 240000.000000 41.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 67091.000000 64006.250000 ... 54506.000000 50190.500000 49198.250000 5006.000000 5.000000e+03 4505.00000 4013.250000 4031.500000 4000.000000 0.000000
max 1000000.000000 267.000000 8.000000 8.000000 8.000000 8.000000 8.000000 8.000000 964511.000000 983931.000000 ... 891586.000000 927171.000000 961664.000000 873552.000000 1.684259e+06 896040.00000 621000.000000 426529.000000 528666.000000 1.000000

8 rows × 21 columns

In [11]:
model_features = data.columns.drop('default payment next month')
model_target = 'default payment next month'
In [12]:
data.nunique()
Out[12]:
ID
LIMIT_BAL                        81
SEX                               2
EDUCATION                         5
MARRIAGE                          3
AGE                              63
PAY_0                            11
PAY_2                            11
PAY_3                            11
PAY_4                            11
PAY_5                            10
PAY_6                            10
BILL_AMT1                     22723
BILL_AMT2                     22346
BILL_AMT3                     22026
BILL_AMT4                     21548
BILL_AMT5                     21010
BILL_AMT6                     20604
PAY_AMT1                       7943
PAY_AMT2                       7899
PAY_AMT3                       7518
PAY_AMT4                       6937
PAY_AMT5                       6897
PAY_AMT6                       6939
default payment next month        2
dtype: int64
In [13]:
data.isna().sum()
Out[13]:
ID
LIMIT_BAL                       0
SEX                             0
EDUCATION                     331
MARRIAGE                      323
AGE                             0
PAY_0                           0
PAY_2                           0
PAY_3                           0
PAY_4                           0
PAY_5                           0
PAY_6                           0
BILL_AMT1                       0
BILL_AMT2                       0
BILL_AMT3                       0
BILL_AMT4                       0
BILL_AMT5                       0
BILL_AMT6                       0
PAY_AMT1                        0
PAY_AMT2                        0
PAY_AMT3                        0
PAY_AMT4                        0
PAY_AMT5                        0
PAY_AMT6                        0
default payment next month      0
dtype: int64
In [14]:
#check unique values for the catogorical columns
for column in data.select_dtypes(include=['object']).columns:
    print(data[column].unique())
['female' 'male']
['university' 'graduate school' 'others' 'high school' nan 0]
['married' 'single' nan 0]
In [15]:
#check the duplicaation in the data
duplicates = data[data.duplicated()]
len(duplicates)
Out[15]:
35
In [16]:
#check data balance
data[model_target].value_counts()/data.shape[0]
Out[16]:
0.0    0.7788
1.0    0.2212
Name: default payment next month, dtype: float64
In [17]:
#check the data ditrebution for the catogorical data 
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns')*100)
col_0 % observations
SEX
female 50.0
male 50.0
col_0 % observations
EDUCATION
0 20.0
graduate school 20.0
high school 20.0
others 20.0
university 20.0
col_0 % observations
MARRIAGE
0 33.333333
married 33.333333
single 33.333333

3.1.1 Data Explore - findings

  • the data has 30000 observations and 24 features
  • all data features are object type . so, i have to convert the columns that have numeric values to numeric type to be able to get more insights from the data
  • The catogrical data has NANs and zeros value must be imputed or deleted
  • Data has 35 duplicates rows must be deleted
  • "PAY_" cloums have values not defind in data description which are(0,-2). as per kaggle discussions -2 -> "No consumption" and 0 -> "The use of revolving credit"
  • there is an outlier in Age Feture which is some of records have age more than 100 year
  • There ara record in "BILL_" columns with negative values "which is weired". as per kaggle discussions It is generally possible for a credit card customer to overpay their bill and temporarily carry a negative balance
  • The data has 77% with default payment flag with value 0 and 22% with default payment flag with value 1. which means the data labels imbalanced
In [18]:
#data balance from target prespective
temp = data["default payment next month"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
df.iplot(kind='pie',labels='labels',values='values', title='default payment or not')
In [19]:
temp = data["AGE"]
temp.iplot(kind='histogram', title='Age Distribution')
In [20]:
temp = data["EDUCATION"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
df.iplot(kind='pie',labels='labels',values='values', title='EDUCATION Status ', hole = 0.5)
In [21]:
temp = data["MARRIAGE"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
df.iplot(kind='pie',labels='labels',values='values', title='MARRIAGE Status ', hole = 0.5)
In [22]:
plt.figure(figsize = (40,20))
sns.heatmap(data.corr(),annot = True,square = True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x23f83562288>
In [ ]:
 
In [23]:
#Remove Duplicated data
clean_data=data.drop_duplicates()
In [24]:
#drop NAN data
clean_data=clean_data.dropna()
In [25]:
#remove the outliers
clean_data=clean_data[clean_data['AGE'] <= 100]
In [26]:
temp = clean_data["AGE"]
temp.iplot(kind='histogram', title='Age Distribution')
In [27]:
#remove the unassigend EDUCATION
clean_data=clean_data[clean_data['EDUCATION'] != 0]
In [28]:
#remove the unassigned MARRIAGE status
clean_data=clean_data[clean_data['MARRIAGE'] != 0]
In [29]:
for column in clean_data.select_dtypes(include=['object']).columns:
    print(clean_data[column].unique())
['female' 'male']
['university' 'graduate school' 'others' 'high school']
['married' 'single']
In [30]:
#convert catogerical data to numerical by applying on-hot encoding
categorical_columns  =['SEX','EDUCATION','MARRIAGE']
data_dummies = pd.get_dummies(clean_data[categorical_columns], drop_first=True)
clean_data = pd.concat([clean_data, data_dummies], axis = 1)
clean_data.drop(categorical_columns,axis=1, inplace=True)
In [31]:
# crearte another data but with appling minmax scaler
from sklearn import preprocessing
minmax= preprocessing.MinMaxScaler().fit(clean_data)
minmax_clean_data=minmax.transform(clean_data)
minmax_clean_data=pd.DataFrame(minmax_clean_data,columns=list(clean_data))
In [32]:
# crearte another data but with upsampling the data to solve imbalanced data
from sklearn.utils import resample
#minmax_Y = minmax_clean_data['default payment next month'].copy()
#minmax_X = minmax_clean_data[features].copy()


maj_data=minmax_clean_data[minmax_clean_data['default payment next month']==0]
min_data=minmax_clean_data[minmax_clean_data['default payment next month']==1]

min_data_oversample=resample(min_data,replace=True,n_samples=22000,random_state=587)

oversample_minmax_clean_data=pd.concat([maj_data,min_data_oversample])
oversample_minmax_clean_data['default payment next month'].value_counts()
Out[32]:
0.0    22728
1.0    22000
Name: default payment next month, dtype: int64
In [33]:
#data balance from target prespective
temp = oversample_minmax_clean_data["default payment next month"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
df.iplot(kind='pie',labels='labels',values='values', title='default payment or not')
In [34]:
features = ['LIMIT_BAL', 'SEX_male', 'EDUCATION_high school','EDUCATION_university','EDUCATION_others', 'MARRIAGE_single', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
In [35]:
#Train/Test data for cleaned dataset
clean_Y = clean_data['default payment next month'].copy()
clean_X = clean_data[features].copy()
clean_X_train, clean_X_test, clean_y_train, clean_y_test = train_test_split(clean_X,clean_Y, test_size=0.20,shuffle=True, random_state=42)
In [36]:
#Train/Test data for cleaned and minmax scaled dataset
minmax_clean_Y = minmax_clean_data['default payment next month'].copy()
minmax_clean_X = minmax_clean_data[features].copy()
minmax_clean_X_train, minmax_clean_X_test, minmax_clean_y_train, minmax_clean_y_test = train_test_split(minmax_clean_X,minmax_clean_Y, test_size=0.20,shuffle=True, random_state=42)
In [37]:
#Train/Test data for cleaned and minmax scaled nand upsampled dataset
oversample_minmax_clean_Y = oversample_minmax_clean_data['default payment next month'].copy()
oversample_minmax_clean_X = oversample_minmax_clean_data[features].copy()
oversample_minmax_clean_X_train, oversample_minmax_clean_X_test, oversample_minmax_clean_y_train, oversample_minmax_clean_y_test = train_test_split(oversample_minmax_clean_X,oversample_minmax_clean_Y, test_size=0.20,shuffle=True, random_state=42)
In [38]:
#DecisionTreeClassifier with tree depth =2 and cleaned data
classifier = DecisionTreeClassifier(max_depth=2, random_state=14) 
# fit the classifier
classifier.fit(clean_X_train, clean_y_train)
# test the predictions
predictions = classifier.predict(clean_X_test)
In [39]:
cf=confusion_matrix(clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8134723884424688
AUC     : 0.6407972643072393
Precision: 0.7116968698517299
Recall: 0.32047477744807124
F1_score: 0.4419437340153453
In [40]:
#DecisionTreeClassifier with tree depth =2 and cleaned-minmax scale data
classifier = DecisionTreeClassifier(max_depth=2, random_state=14) 
# fit the classifier
classifier.fit(minmax_clean_X_train, minmax_clean_y_train)
# test the predictions
predictions = classifier.predict(minmax_clean_X_test)
In [41]:
cf=confusion_matrix(minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8134723884424688
AUC     : 0.6407972643072393
Precision: 0.7116968698517299
Recall: 0.32047477744807124
F1_score: 0.4419437340153453
In [42]:
#DecisionTreeClassifier with tree depth =2 and cleaned-minmax scale and updampled data
classifier = DecisionTreeClassifier(max_depth=2, random_state=14) 
# fit the classifier
classifier.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
# test the predictions
predictions = classifier.predict(oversample_minmax_clean_X_test)
accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
Out[42]:
0.6880169908338922
In [43]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.6880169908338922
AUC     : 0.683840920701148
Precision: 0.7726957726957727
Recall: 0.5108820160366552
F1_score: 0.6150875741277065
In [44]:
#DecisionTreeClassifier with tree depth =32 and cleaned-minmax scale and updampled data
classifier = DecisionTreeClassifier(max_depth=32, random_state=14) 
# fit the classifier
classifier.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
# test the predictions
predictions = classifier.predict(oversample_minmax_clean_X_test)
In [45]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8712273641851107
AUC     : 0.8732910700180261
Precision: 0.8115183246073299
Recall: 0.9587628865979382
F1_score: 0.8790170132325141

5.1.1 Decision Tree -findings

  • applying mimax scaled data not afffect the model accurace . it is the same as orignal data.
  • for original / scaled data more tree depth more overfititng for the model
  • upsampled data do better with more tree depth
  • Maximum Accuracy 0.87
In [46]:
from sklearn import linear_model
In [47]:
#LogisticRegression modle for cleaned data
logreg =linear_model.LogisticRegression()
logreg.fit(clean_X_train, clean_y_train)
# test the predictions
predictions = logreg.predict(minmax_clean_X_test)
accuracy_score(y_true = clean_y_test, y_pred = predictions)
C:\Users\ws250158\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[47]:
0.7695332535476149
In [48]:
cf=confusion_matrix(clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.7695332535476149
AUC     : 0.5231251775899454
Precision: 0.5
Recall: 0.06602373887240356
F1_score: 0.11664482306684143
In [49]:
#LogisticRegression modle for cleaned data
logreg =linear_model.LogisticRegression()
# fit the classifier
logreg.fit(minmax_clean_X_train, minmax_clean_y_train)
# test the predictions
predictions = logreg.predict(minmax_clean_X_test)
C:\Users\ws250158\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [50]:
cf=confusion_matrix(minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8009916224995726
AUC     : 0.6007284401685877
Precision: 0.7119815668202765
Recall: 0.22922848664688428
F1_score: 0.3468013468013468
In [51]:
#LogisticRegression modle for cleaned data
logreg =linear_model.LogisticRegression()
# fit the classifier
logreg.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
# test the predictions
predictions = logreg.predict(oversample_minmax_clean_X_test)
C:\Users\ws250158\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [52]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.6794097920858484
AUC     : 0.6778725214185891
Precision: 0.6936610608020699
Recall: 0.6142038946162658
F1_score: 0.6515188335358445

5.2.1 Logistic Regression - Findings

  • mimax scaled data same model accurace as orignal data.
  • upsampled data did not do well with logistic regression model. my be i need to adjust few hyper parameters!!
  • Maximum accuracy 0.8
In [53]:
from sklearn.neighbors import KNeighborsClassifier 
In [54]:
#LogisticRegression modle for cleaned data
knn=KNeighborsClassifier(n_neighbors=10)
knn.fit(clean_X_train, clean_y_train)
# test the predictions
predictions = knn.predict(clean_X_test)
In [55]:
cf=confusion_matrix(clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.764233202256796
AUC     : 0.5251380009849443
Precision: 0.43824701195219123
Recall: 0.08160237388724036
F1_score: 0.13758599124452783
In [56]:
#LogisticRegression modle for cleaned data
knn=KNeighborsClassifier(n_neighbors=10)
knn.fit(minmax_clean_X_train, minmax_clean_y_train)
# test the predictions
predictions = knn.predict(minmax_clean_X_test)
In [57]:
cf=confusion_matrix(minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.797914173362968
AUC     : 0.6197753944556996
Precision: 0.6351791530944625
Recall: 0.2893175074183976
F1_score: 0.39755351681957185
In [58]:
#LogisticRegression modle for cleaned data
knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
# test the predictions
predictions = knn.predict(oversample_minmax_clean_X_test)
In [59]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.878716744913928
AUC     : 0.8806038838141405
Precision: 0.8222003929273084
Recall: 0.9587628865979382
F1_score: 0.8852459016393442

5.3.1 KNN - Findings

  • MinMax scaled data work better with KNN than original data
  • for original data and MinMax scaled data , more neigbors number more effecient prediction
  • but for upsampled data , less neigbors numbers more effecient prediction
  • Maximum Accuracy 0.87
In [60]:
from sklearn.ensemble import RandomForestClassifier
In [61]:
rf=RandomForestClassifier(n_jobs=10,random_state=10, n_estimators=20, verbose=False)
rf.fit(clean_X_train, clean_y_train)
#test the predictions
predictions = rf.predict(clean_X_test)
In [62]:
cf=confusion_matrix(clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8045819798256112
AUC     : 0.6503509441027612
Precision: 0.631917631917632
Recall: 0.3642433234421365
F1_score: 0.4621176470588235
In [63]:
rf=RandomForestClassifier(n_jobs=10,random_state=10, n_estimators=20, verbose=False)
rf.fit(minmax_clean_X_train, minmax_clean_y_train)
#test the predictions
predictions = rf.predict(minmax_clean_X_test)
In [64]:
cf=confusion_matrix(minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8047529492220893
AUC     : 0.6507218639840668
Precision: 0.6323907455012854
Recall: 0.3649851632047478
F1_score: 0.46284101599247407
In [65]:
rf=RandomForestClassifier(n_jobs=10,random_state=20, n_estimators=20, verbose=False)
rf.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
#test the predictions
predictions = rf.predict(oversample_minmax_clean_X_test)
In [66]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.9311424100156495
AUC     : 0.9317341686976912
Precision: 0.9075886062187432
Recall: 0.9562428407789233
F1_score: 0.9312806782686301

5.4.1 Random Forrest - Findings

  • mimax scaled data same model accurace as orignal data.
  • upsampled data do way better with random forest model than other data
  • Maximum accuracy 0.93
In [67]:
from xgboost import XGBClassifier
In [68]:
# fit model no training data
xgb = XGBClassifier(use_label_encoder=False,
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=100,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
xgb.fit(clean_X_train, clean_y_train)
#test the predictions
predictions = xgb.predict(clean_X_test)
[17:02:49] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
In [69]:
cf=confusion_matrix(clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.7987690203453581
AUC     : 0.6486526732931752
Precision: 0.6033857315598549
Recall: 0.3701780415430267
F1_score: 0.4588505747126437
In [70]:
# fit model no training data
xgb = XGBClassifier(use_label_encoder=False,
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=100,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
xgb.fit(minmax_clean_X_train, minmax_clean_y_train)
#test the predictions
predictions = xgb.predict(minmax_clean_X_test)
[17:03:26] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
In [71]:
cf=confusion_matrix(minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.8003077449136604
AUC     : 0.6517311187688591
Precision: 0.6079136690647482
Recall: 0.37611275964391694
F1_score: 0.464711274060495
In [72]:
# fit model no training data
xgb = XGBClassifier(use_label_encoder=False,
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=100,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
xgb.fit(oversample_minmax_clean_X_train, oversample_minmax_clean_y_train)
#test the predictions
predictions = xgb.predict(oversample_minmax_clean_X_test)
[17:04:04] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
In [73]:
cf=confusion_matrix(oversample_minmax_clean_y_test, predictions)
sns.heatmap(cf, annot=True, fmt='.0f', cmap="YlGnBu").set_title('Confusion Matrix')
accuracy = accuracy_score(y_true = oversample_minmax_clean_y_test, y_pred = predictions)
print(f'Accuracy: {accuracy}')
auc = roc_auc_score(oversample_minmax_clean_y_test, predictions)
print(f'AUC     : {auc}')
precision, recall, f1_score, _ = precision_recall_fscore_support(oversample_minmax_clean_y_test, predictions, average = 'binary')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1_score: {f1_score}')
Accuracy: 0.9308070646098815
AUC     : 0.9314877452138709
Precision: 0.9043609671848014
Recall: 0.9596792668957618
F1_score: 0.9311992886517729

5.5.1 XGBoost - Findings

  • original data and minmax data has same accuracy with xdboost
  • Upsampled data worked very good with xgboost
  • more max depth more accuracy
  • maximum accuracy 0.93, we c an get more acuuracy but with more hyperparameter tuning

------------------------------------------------------------------------------------------------

6. Conclusion

  • Feature Engineering helped me alot to improve model accuracy
  • balanced data way better than imbalanced data in model train/test
  • Accuracy from best to worst (Random Forest > XGBoost > KNN > Decision Tree > Logistic Regression)

7. Improvement Ideas

  • Use SMOTE oversampling technique rather than sklearn resample
  • use hyperopt to tune models hyperparameter